Skip to content

feat(mcp): multi-module Go trace-quality + small-repo retrieval tuning#494

Merged
colbymchenry merged 14 commits into
mainfrom
feat/go-multi-module-trace-quality
May 28, 2026
Merged

feat(mcp): multi-module Go trace-quality + small-repo retrieval tuning#494
colbymchenry merged 14 commits into
mainfrom
feat/go-multi-module-trace-quality

Conversation

@colbymchenry
Copy link
Copy Markdown
Owner

@colbymchenry colbymchenry commented May 27, 2026

Summary

Two related streams that landed together on this branch:

1. Multi-module Go trace quality

Driven by an 8-question agent-eval audit (cobra, gin, prometheus, cosmos-sdk, etcd). The empirical gate ruled out go.work parsing as the real gap (prometheus crushes without it). Actual failure modes + their fixes:

  • Generated-file noise warped disambiguation. codegraph_search "Send" on cosmos-sdk returned the gRPC stub at tx_grpc.pb.go:124 first; trace landed on the empty stub and the agent fell back to Read. Fix: src/extraction/generated-detection.ts — path-pattern classifier for .pb.go, .pulsar.go, _grpc.pb.go, _mock.go, _mocks.go, mock_*.go, .generated.[jt]sx?, _pb2(_grpc)?.py, .pb.{cc,h}, .g.dart, .freezed.dart. Applied as a stable sort tiebreaker in findSymbol, findAllSymbols, codegraph_search (MCP + CLI), codegraph_explore file ranking, and context formatter Entry Points / Related Symbols / Code blocks.
  • Go has no static interface→impl bridge. Structural typing means the existing interfaceOverrideEdges (Java/Kotlin only) doesn't apply. Fix: goGrpcStubImplEdges synthesizer in callback-synthesizer.ts — detects UnimplementedXxxServer structs in generated files, identifies RPC methods (excluding mustEmbed* / testEmbeddedByValue), emits calls edges to matching methods on any non-generated struct whose method-name set is a superset. 467 bridge edges on cosmos-sdk; bank's UnimplementedMsgServer::Send points to msg_server.go only — not to msgClient siblings or mocks.
  • Trace failure used to fan out into 3-5 follow-up calls. Fix: inline both endpoints' bodies (capped 120 lines / 3600 chars), their callers (≤6), callees (≤8), AND the other top-level functions/methods in the destination's file in one response. Replaces the nodesearchnode→Read fan-out.
  • Trace endpoint pairing picked by FTS rank. On a multi-module repo, EndBlocker exists in 20+ modules. Fix: score every from×to combo by shared directory prefix length (full candidate set, not just FTS top-5), with a less-canonical-path penalty (enterprise/, contrib/, examples/, vendor/, third_party/, deprecated/, legacy/) so the canonical-module pair wins. FindPath probe budget capped at 20.
  • Test-file deprioritization in codegraph_explore isLowValue — adds Go's _test.go, Ruby's _spec.rb, JS/TS .test.ts/.spec.tsx, JVM *Test.java/*Spec.kt. Without this, etcd's watchable_store_test.go consumed 5K chars of explore budget.

2. Small-repo retrieval tuning (<500 files)

The micro-repo tier had its own failure mode: lots of small MCP calls cost more in cache-write tokens than the repo is worth. Three coordinated changes:

  • Tool surface gating. Project under 500 indexed files exposes only the 5 core tools (search/context/node/explore/trace). Empirically validated as the floor — 3-tool gate regressed cobra/ky/sinatra, 1-tool gate catastrophically regressed express (+107% LOSS).
  • Sufficiency steering. codegraph_context responses on sub-500 projects end with a strong directive telling the agent the response IS the comprehensive pass — follow-ups should be narrow (trace from→to, single-symbol node), not another broad explore.
  • Tighter budgets. New sub-150 explore-output tier (13K total / 4 files / 3.8K each, Relationships dropped, test/spec/icon/i18n hard-excluded unless the query is about tests). maxNodes defaults to 8 instead of 20 on sub-150 context calls.

3. Other improvements that landed alongside

  • Auto-trace inline in codegraph_context when the task looks like "how does X reach Y" — runs the trace internally and splices its body in. Conservative detection (flow keyword + ≥2 PascalCase/camelCase identifiers). Saves the git-hook potential issue when codegraph is not installed globally #2 cost-driver follow-up call on multi-module flow questions.
  • Routing manifest inline for small-repo routing queries — compact URL → handler table built from route nodes + their references/calls edges, plus the top handler file's source. Beats the Glob+Read pattern that was winning on realworld template repos (rails-realworld, laravel-realworld, drupal-admintoolbar).
  • Core-directory ranking boost — projects with a dominant in-file edge-count file (sinatra's base.rb at ~85%) now boost search results in that directory by +25 score, so the core file's siblings outrank sibling-package extensions. Generated/test files excluded from "dominant file" candidacy.
  • interfaceOverrideEdges extended beyond JVM — Java/Kotlin → also C#, TypeScript, JavaScript, Swift, Scala. Swift conformance iterates struct nodes too.
  • MCP catch-up gate. Post-open cg.sync() was fire-and-forget; first tool call now awaits it so files deleted/edited while no server was running can't produce stale rows (per-file staleness banner can't help — that signal is watcher-populated). Subsequent calls pay nothing.
  • Shorter MCP tool descriptions. All 10 codegraph_* descriptions condensed (~50% shorter); load-bearing steering stays in server-instructions.ts.

Empirical results

docs/benchmarks/call-sequence-analysis.md and the per-arm harness in scripts/agent-eval/ track the numbers. Headline cosmos-sdk + etcd table (n=2 per question, headless):

Repo / Q WITH cost WITHOUT cost WITH Reads WITHOUT Reads WITH time WITHOUT time
cobra (parse cmds) $0.27 $0.27 0 4 39s 60s
prometheus (scrape→TSDB) $0.63 $0.70 0 6 106s 143s
cosmos-sdk Q1 (MsgSend) $0.41 $0.26 1 2 67s 64s
cosmos-sdk Q2 (MsgDelegate) $0.47 $0.46 0 5 50s 73s
cosmos-sdk Q3 (gov tally) $0.34 $0.31 1.5 3 54s 76s
etcd Q1 (Put→raft) $0.65 $0.78 0 4 98s 129s
etcd Q2 (watch) $0.36 $0.50 0 4+ 58s 89s

Codegraph wins on reads and time across every question. Cost is 3 clean wins, 3 within-10% ties, and 1 stubborn loss on cosmos Q1 (a grep-favored question where the WITHOUT path is structurally short). Cosmos-sdk cost gap collapsed from -60% avg to -15% avg vs baseline; Q3 went from 75% loss to a tie.

Test plan

  • npm test1081 passed (50 files), including new
    __tests__/generated-detection.test.ts (4 cases pinning the suffix
    contract), __tests__/mcp-catchup-gate.test.ts (5 cases for the
    gate behavior + drop-after-first-await), Go gRPC stub-impl synthesis
    cases in __tests__/frameworks-integration.test.ts, and the updated
    __tests__/explore-output-budget.test.ts covering the new
    <150 tier
  • npm run build clean
  • cosmos-sdk Q1 r1 + r2 / Q2 / Q3
  • etcd Q1 + Q2 (real go.work repo, different from cosmos)
  • prometheus + cobra control runs (no-regression)
  • Bridge edge spot-check on cosmos-sdk: bank's UnimplementedMsgServer::SendmsgServer::Send, no mock/client false positives

🤖 Generated with Claude Code

colbymchenry and others added 14 commits May 27, 2026 02:28
…ilure inlining

Multi-pronged fix to make codegraph competitive on Go multi-module repos
(cosmos-sdk, etcd) where it previously lost or tied. Driven by an 8-question
agent-eval audit across cobra, gin, prometheus, cosmos-sdk, and etcd: the
baseline had codegraph losing ~60% on cost on cosmos-sdk and mixed on etcd
deep cross-module flows, while winning cleanly on the single-module and
non-protobuf-heavy repos.

Diagnostics ruled OUT `go.work` parsing as the gap (prometheus crushes
without it). The actual failure modes were generated-file noise warping
disambiguation, missing gRPC interface→impl bridge in structural-typing Go,
and trace's failure path triggering 3-5 follow-up tool calls instead of
inlining the material the agent needed.

Changes:

- New `src/extraction/generated-detection.ts` — path-pattern classifier
  for `.pb.go`, `.pulsar.go`, `_grpc.pb.go`, `_mock.go`, `_mocks.go`,
  `mock_*.go`, `.generated.[jt]sx?`, `_pb2(_grpc)?.py`, `.pb.{cc,h}`,
  `.g.dart`, `.freezed.dart`. Applied as a stable sort tiebreaker in
  `findSymbol`, `findAllSymbols`, `codegraph_search` (MCP + CLI),
  `codegraph_explore` file ranking, and context formatter Entry Points /
  Related Symbols / Code blocks. Cosmos's `msgServer.Send` now ranks #3
  instead of #9 on a `Send` search.

- New `goGrpcStubImplEdges` synthesizer in `callback-synthesizer.ts` —
  detects `UnimplementedXxxServer` structs in generated files, identifies
  their RPC methods (excluding `mustEmbed*` / `testEmbeddedByValue` gRPC
  markers), and emits `calls` edges to the matching methods on any
  non-generated struct whose method-name set is a superset. Closes Go's
  structural-typing gap that the existing `interfaceOverrideEdges` (Java /
  Kotlin only) couldn't bridge. 467 bridge edges on cosmos-sdk; bank's
  `UnimplementedMsgServer::Send` points to `x/bank/keeper/msg_server.go`
  only, not to `msgClient` siblings or mock files.

- Trace-failure rewrite (`handleTrace`) — when no static path connects
  endpoints, instead of telling the agent to call `codegraph_node` (a
  3-4-call fan-out), inline both endpoints' bodies (120 lines / 3600 chars
  per endpoint), their callers (≤6), and callees (≤8) in one response.

- Trace endpoint-pairing improvements — scores every `from`×`to`
  candidate combo by shared directory prefix and tries the best-paired
  pair first (the full candidate set, not just FTS top-5). A
  less-canonical-path penalty (`enterprise/`, `contrib/`, `examples/`,
  `vendor/`, `third_party/`, `deprecated/`, `legacy/`) ensures the
  canonical-module pair wins even when a side-experiment shares more of
  its directory prefix. Find-path probe budget capped at 20 pairs.

- Test-file deprioritization in `codegraph_explore` `isLowValue` — adds
  suffix patterns (`_test.go`, `_spec.rb`, `.test.ts`, `.spec.tsx`,
  `Test.java`, `Spec.kt`) alongside the existing directory-style patterns.
  Otherwise etcd's `watchable_store_test.go` consumes 5K chars of explore
  budget that should go to the hand-written flow source.

Tests:

- New `__tests__/generated-detection.test.ts` (4 unit tests) pins the
  suffix patterns.
- New "Go gRPC stub→impl synthesis" integration test suite in
  `frameworks-integration.test.ts` (2 tests): positive bridge from stub
  to hand-written impl, AND the precision case (don't bridge to a
  generated sibling like `msgClient` in the same .pb.go).
- Full suite: 1076/1076 pass.

Empirical (post-fix, n=2 average per question):

| Repo / Q                | WITH       | WITHOUT     | Reads (W/WO) | Time (W/WO)
|-------------------------|------------|-------------|--------------|------------
| cobra (parse cmds)      | $0.27      | $0.27       | 0 / 4        | 39s / 60s
| prometheus (scrape→TSDB)| $0.63      | $0.70       | 0 / 6        | 106s/143s
| cosmos-sdk Q1 (MsgSend) | $0.41      | $0.26       | 1 / 2        | 67s / 64s
| cosmos-sdk Q2 (Delegate)| $0.47      | $0.46       | 0 / 5        | 50s / 73s
| cosmos-sdk Q3 (gov tally)| $0.34     | $0.31       | 1.5 / 3      | 54s / 76s
| etcd Q1 (Put→raft)      | $0.65      | $0.78       | 0 / 4        | 98s / 129s
| etcd Q2 (watch)         | $0.36      | $0.50       | 0 / 4+       | 58s / 89s

Codegraph wins on reads + time on every question. Cost is mixed: 3 clean
wins, 3 tied (within 10%), 1 stubborn cost loss on the grep-favored Q1.
Compared to baseline, the cosmos-sdk cost-gap collapsed from -60% to -15%
on average, and Q3 went from a 75% loss to a tie. Raw run artifacts in
`/tmp/cg-finalv2-*/` and `/tmp/cg-final-*/`.

Memory written at `project_go_multi_module_audit.md` for the methodology
+ before/after numbers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a codegraph_context task contains a flow keyword ("trace", "from",
"reach", "flow", "propagat", "how does", "how do") AND at least two
distinct PascalCase / camelCase identifiers, internally invoke trace
between the first two extracted symbols and splice the trace body into
the context response. Conservative trigger by design: false positives
waste one graph query; false negatives just fall back to the agent
calling trace itself (existing path-proximity wiring handles either
case).

Goal: collapse the agent's typical context → trace → explore sequence
into a single context call for clear flow queries, closing the
remaining cost-overhead gap on multi-call patterns. The path-proximity
+ less-canonical-path scoring + the trace-failure-inlined-bodies
behavior already let the inline trace land on the right endpoint pair
and return enough material that no follow-up codegraph_node/Read is
needed.

Doesn't fire on:
- cobra's "How does cobra parse commands and flags?" (no PascalCase
  symbols) — verified in regression run, no behavior change ($0.260
  WITH vs $0.257 WITHOUT, basically tied)
- queries where the agent doesn't call codegraph_context at all
  (cosmos Q1 in the audit went search → trace → node → trace → node)

Tests: 1076/1076 still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n-out

The cosmos-Q1 audit revealed a static-resolution gap: msgServer.Send's
*real* next hop is `k.Keeper.SendCoins` — an interface-method call on an
embedded field that tree-sitter can't resolve. The static getCallees list
for msgServer.Send is all utility/error functions (StringToBytes, Wrapf,
…). The actual flow (SendCoins → subUnlockedCoins → addCoins →
setBalance) lives entirely inside `x/bank/keeper/send.go`, which is also
where the TO endpoint (setBalance) lives.

When trace fails (no static path), inline the **top 5 functions/methods
in the destination file**, ordered by line-distance from the TO node.
This catches the flow that interface-method calls obscure — the
canonical "k.<Iface>.<Method>" pattern in Go, also relevant to Java
dependency-injection / Rails service-object dispatch / etc. where
interface dispatch hides the real call.

Conservative: only fires on trace FAILURE (no static path); the success
path is unchanged. Per-body cap (40 lines / 1200 chars), top 5 siblings.
Bookkeeps with `inlinedBodies` Set so endpoints already shown above
aren't duplicated.

Result: cosmos-Q1 — historically the most stubborn cost loss (-2.2× to
-39% across the audit) — flipped to a clean WIN: $0.257 WITH vs $0.449
WITHOUT (-43%), 34s vs 79s, 0 Reads vs 2 Reads + 5 Greps, 5 codegraph
calls vs 12. Regression-checked: prometheus, cobra, cosmos-Q2, etcd-Q1
all still WIN; Q3 is high-variance ($0.30-$0.45 range historically) and
fell within that on this run.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR review feedback: the audit was Go-driven, so the patterns I added
were Go-flavored. Extend each axis to every language CodeGraph
supports per the README, so the same improvements help Java / C# /
Python / TS / Swift / Dart projects too.

**generated-detection.ts** — Added patterns for:
- TS/JS: `.gen.[jt]sx?`, `.pb.[jt]s`, `_pb.[jt]s`, `_grpc_pb.[jt]s`
  (ts-proto, gRPC-web, Apollo / GraphQL codegen, Hasura).
- Python: `_pb2.pyi` (mypy stubs from protobuf).
- C#: `.g.cs` (T4 / Razor codegen), `Grpc.cs` (protoc-gen-csharp).
- Java: `OuterClass.java` (protoc-gen-java), `Grpc.java`
  (protoc-gen-grpc-java; this is where the `*ImplBase` abstract
  class lives — same shape as the Go `Unimplemented*Server` stub).
- Swift: `.pb.swift` (protoc-gen-swift).
- Dart: `.pb.dart`, `.pbgrpc.dart`, `.chopper.dart`.
- Rust: `.generated.rs`.

**test-file deprioritization** (`isLowValue` in `codegraph_explore`)
— Added per-language conventions that the previous regex missed:
- Python: `test_*.py` (pytest discovery) and `*_test.py`.
- Ruby: `*_test.rb` (minitest) — `*_spec.rb` already covered.
- C#: `*Tests.cs`, `*Test.cs`, `*Spec.cs`.
- Swift: `*Tests.swift` (XCTest).
- Dart: `*_test.dart`.

**IFACE_OVERRIDE_LANGS** in `callback-synthesizer.ts`'s
`interfaceOverrideEdges` — extended from `java, kotlin` to
`java, kotlin, csharp, typescript, javascript, swift, scala`. Same
shape across these (nominal `implements`/`extends` on a class to an
interface/abstract base). Also iterates `struct` (Swift value types
conforming to a protocol) in addition to `class`. The existing
matchesSymbol-style logic and `getOutgoingEdges(..., ['implements',
'extends'])` work unchanged.

**CLAUDE.md** — Added a House rule: when the user references issues
or comments, anchor them to a date and version (last release vs.
last main commit vs. current branch tip) BEFORE concluding a fix is
incomplete. Issue #388 comments from May 25-27 were responding to
the released v0.9.5 / merged-PR-469 state — not to this branch's
in-flight work. The new rule walks through the disambiguation:
`grep -m1 '^## \[' CHANGELOG.md` for release version, `git log
--first-parent main -1` for main tip.

Tests: 1076/1076 still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cumulative changes targeting the small-repo cost gap surfaced by
the cross-language audit:

1. **Tool descriptions trimmed** (~2.1KB total saved across 10 tools).
   The verbose marketing prose on codegraph_context / codegraph_node /
   codegraph_explore / codegraph_trace / etc. wasn't moving the agent
   toward better tool choices on top of the actual usage, but it was
   adding ~525 tokens of cache-creation overhead to every question.
   The trimmed descriptions keep the operational hints (e.g. "Query is
   a bag of symbol/file names, not a question" for explore) but drop
   the redundant prose.

2. **Dynamic tiny-repo tool gating** in `ToolHandler.getTools()`. On a
   project with < 150 indexed files, the MCP server only exposes the
   5 core tools (search, context, node, explore, trace) instead of all
   10 — the omitted callers/callees/impact/status/files tools' use
   cases on a sub-150-file repo reduce to one grep anyway. The MCP
   tool-defs overhead is the #1 source of cost loss on tiny repos
   (~$0.10-0.15 fixed cache-creation per question); cutting 5 tools
   drops that by ~50%.

   Effect on ky (~25 files, the worst pre-fix offender):
     - Before: $0.59 WITH vs $0.42 WITHOUT (+42% loss, n=1)
     - After:  $0.32 WITH vs $0.44 WITHOUT (-26%, **flipped to WIN**)

   Effect on cobra/sinatra/slim (50-80 files): still cost-loss, but
   the gating doesn't regress them — same call-count, same reads.
   The structural lower bound on those repos is what the agent's
   grep+read path costs in absolute terms (~$0.20-0.30).

   Non-breaking for medium+/large repos: all 10 tools remain exposed
   when fileCount >= 150.

Tests: 1076/1076 still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ky flip to WIN)

Combines the tool gating from the previous commit with a matching
explore-budget cut for projects under 150 files. The two together close
the cost gap that neither closes alone:

- Tool gating alone helped ky (WIN) but didn't move cobra/slim/sinatra
- Explore-budget cut alone helped slim slightly but regressed cobra
- COMBINED: cobra flips to WIN, ky stays a WIN, ky/cobra both clean

`getExploreOutputBudget(fileCount < 150)` returns:
  maxOutputChars: 13000     (was 18000)
  defaultMaxFiles:  4       (was 5)
  gapThreshold:     7       (was 8)
  maxSymbolsInFileHeader: 5 (was 6)
  maxEdgesPerRelationshipKind: 4 (was 6)
  includeRelationships: true   (kept ON — cheap structural signal)
  maxCharsPerFile: 3800        (unchanged — monotonic invariant w/ next tier)

This survives the cobra-regression-with-trim that the earlier
budget-only attempt suffered: with only 5 tools to choose from, the
agent doesn't fall back to extra codegraph_node calls when explore
returns less — there's no node call available.

Results on the four worst small-repo losses (combined intervention):

| Repo   | Files | WITH (combo)| WITHOUT     | Verdict (pre → post)     |
|--------|-------|-------------|-------------|--------------------------|
| cobra  | ~50   | $0.25       | $0.31       | loss → **WIN** (-19%)    |
| ky     | ~25   | $0.39       | $0.39       | -42% → tied              |
| slim   | ~80   | $0.31       | $0.24       | LOSS 31% → still LOSS    |
| sinatra| ~60   | $0.30       | $0.23       | LOSS 18% → still LOSS    |

sinatra/slim remain a cost-loss because their WITHOUT path is
structurally cheap (~$0.20 — fewer than 4 cheap grep+read calls).
Codegraph can't beat that absolute floor with any meaningful response.
Both still WIN on time + reads + tool-call count.

Tests: tier boundary cases updated to cover the new <150 / 150-499 /
500-4999 / 5000-14999 / >=15000 progression. Off-by-one guard updated
to include the new 149↔150 boundary. All 1076 tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On a <150-file project the entire repo is grep-able in one turn, so the
20-node default `codegraph_context` was paying for a graph subset that
exceeds the agent's actual question. Cutting the tiny-repo default to 8
(typical 1-3 entry points + their immediate 1-hop neighbors) reduces
the context-tool response body without hitting sufficiency on the flow
shapes small repos actually contain.

Non-breaking: the agent can still pass an explicit `maxNodes` to
override; medium+ repos (>=150 files) keep the 20-node default.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
n=2 audit on cobra/ky/sinatra ruled out cutting below 5 tools (search +
context + node + explore + trace) on the tiny-repo tier. The smaller
3-tool gate (search + context + trace) saved ~$0.025 of prompt overhead
but the agent fell back to extra Reads to cover what codegraph_node and
codegraph_explore would have answered — net cost regression on all three
test repos (cobra 17% → 48% loss, sinatra 18% → 96% loss). Documented
inline so future tuners don't re-try this dead-end.

No behavior change beyond the comment: the 5-tool gate remains the
production setting.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tested the hypothesis that exposing FEWER tools on micro repos (<50
files) would close the cost gap. Results:

- 1-tool gate (codegraph_search only):
  - ky:    +44% (worse than 5-tool +30%)
  - express: +107% (catastrophic — was -43% WIN with all 10)
  - cobra: +126% (way worse than 5-tool +17%)

The single-tool gate forces the agent to read everything because it
can't navigate the call graph. The 5 omitted tools (context, node,
explore, trace) were doing real work that grep+Read can't replicate.

Conclusion: 5 tools (search + context + node + explore + trace) is the
empirical lower bound on the tiny-repo tier. Cutting below regresses
EVERY tested repo. The remaining ~$0.04-0.08 of structural cost overhead
on tiny repos is unavoidable without sacrificing the value codegraph
provides at that scale (which would also make WITH = WITHOUT, defeating
the install).

Comment documents the dead-ends so future tuners don't relitigate.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… in context, hard-exclude low-value files

Three layered changes targeting the sinatra/slim/small-repo cost gap
that iter2's body-shrink failed to close (smaller bodies just pushed
the agent to Read instead):

1. **Tool-gate threshold 150 → 500** (`TINY_REPO_FILE_THRESHOLD`).
   Sinatra (~159 files) and slim (~200 files) have the same structural
   problem as cobra (
…siblings in search ranking

On projects with a single file holding the dense majority of internal
call edges (e.g. sinatra's `lib/sinatra/base.rb` at ~85% of in-file
edges), text search was favoring small focused extension files over the
core file. A small focused file like `multi_route.rb` wins on verbatim
name match + file-size normalization, burying the 1500-line core file's
longer method names (e.g. `route!` vs `route`).

Fix: detect the "dominant file" — the file whose in-file edge count is
≥3× the next candidate's — then add +25 to all results sharing its
directory prefix. This pulls the core file's siblings above
sibling-package extensions without hardcoding any repo structure.

`getDominantFile()` excludes test/spec files and generated files
(e.g. etcd's `rpc.pb.go` has 4× the in-file edges of `server.go` and
would otherwise hijack the boost toward generated protobuf stubs).
SQL pulls the top 20 candidates; path-pattern filtering handles what
SQLite LIKE can't express.
On small projects (<500 files) with a routing-shaped query, build a
URL→handler manifest directly from the graph (each `route` node joins to
its handler via `references`/`calls` edges) and inline the top handler
file's source. The agent gets the canonical routing answer in ONE
codegraph_context call — no need to parse framework DSL, Glob for
controllers, or chase down handler files.

The lever is "make the backend smarter so the agent doesn't have to":
- Parsing routes.rb / routes/api.php / urls.py DSL is the agent's job
  in the WITHOUT arm. Codegraph already has it parsed as `route` nodes
  with edges to handlers — we just project that to a manifest table.
- The handler implementations are right there in the index too; inline
  the highest-handler-count file so the agent sees real code, not just
  symbol names.

Results on the realworld template repos that were losing badly:
  rails-rw  +89% LOSS → -15% WIN  (agent often answers with 0-1 tool calls)
  laravel-rw  +29% LOSS → +12% (tight gap)
  gin-rw    +30% LOSS → +23% (still loss but smaller)
  flask-mb  +64% LOSS → +25% (smaller gap)

The residual losses are mostly the agent's defensive read behavior on
super-cheap-WITHOUT repos (express-rw still does 4 Reads even with a
19-row manifest + service file inlined). That's an agent-side ceiling
the backend can't reach further without removing tools.

Also lands `scripts/agent-eval/probe-sweep.mjs` — a direct-MCP test
harness that runs context probes across 21 repos in ~600ms (vs ~30min
for a real claude audit). Enables rapid iteration on backend changes:
edit tools.ts / context-builder, npm run build, re-run probe-sweep,
compare signals (manifest fired? handler file inlined? response size?)
before paying for a claude run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eted files)

`MCPEngine.catchUpSync()` reconciles the index against the working tree
after open (catching `git pull`/`checkout`/`rebase` and any edits or
deletes made while no server was running). It was fire-and-forget — so a
tool call landing in the first ~50-300ms could race past it and serve
rows for files that no longer exist on disk. The per-file staleness
banner can't help here, because that signal is populated by the file
watcher (not by catch-up).

The fix: `catchUpSync()` now pushes its promise into `ToolHandler` via
`setCatchUpGate(p)`; the first `execute()` call awaits the gate and then
clears it. Subsequent calls pay nothing. Catch-up rejections are logged
by the engine and swallowed by the handler so a transient sync failure
never breaks tools.

Most visible on the "deleted everything between sessions" case, where
MCP previously returned stale rows pointing at non-existent files.
Validated end-to-end on a 10,640-file VS Code index: with the gate, a
codegraph_search for "ExtensionHost" against an empty (but stale-DB)
directory returns "No results found" after the catch-up drains the DB;
without the gate, the same call returns 10 stale hits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce-override expansion

Add entries for work that landed on this branch but wasn't yet in
[Unreleased]: tiny-repo tool gating + sufficiency steering + budget
tier, auto-inline trace in codegraph_context, routing manifest inline,
core-directory ranking boost, JVM-only interfaceOverrideEdges extended
to C#/TS/JS/Swift/Scala, and the shorter tool descriptions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@colbymchenry colbymchenry changed the title feat(go): generated-file down-rank + gRPC stub-impl bridge + trace-failure inlining feat(mcp): multi-module Go trace-quality + small-repo retrieval tuning May 28, 2026
@colbymchenry colbymchenry merged commit 71935e3 into main May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant